feat: make embedding model configurable for non-English search#553
feat: make embedding model configurable for non-English search#553oeniao wants to merge 1 commit intoMemPalace:developfrom
Conversation
Add an `embedding_model` config option that lets users switch from the
default ChromaDB built-in (all-MiniLM-L6-v2, English-only) to any
HuggingFace sentence-transformers model.
This makes non-English search viable without patching installed files.
For example, setting `paraphrase-multilingual-MiniLM-L12-v2` improves
Chinese/Japanese/Korean recall significantly.
Configuration (priority order):
1. Env var: MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
2. Config: ~/.mempalace/config.json → {"embedding_model": "..."}
3. Default: null (uses ChromaDB built-in, no extra dependencies)
The `sentence-transformers` package is only required when a custom model
is configured; the default path is unchanged and fully backward-compatible.
web3guru888
left a comment
There was a problem hiding this comment.
This fills a real gap — the English-only ONNX default is a genuine pain point for non-Latin users, and patching installed files is not a reasonable workaround.
Design is sound. The layered config priority (env var → file → default) is consistent with how the rest of the codebase handles config. The lazy sentence-transformers import is exactly right: no new hard dependency, clean ImportError message telling the user exactly what to do.
One correctness concern worth flagging: _get_embedding_function() creates a new MempalaceConfig() instance on every call (and get_collection() calls it on every operation). That's minor overhead but not a real problem. The larger issue is that get_collection() passes the embedding function to both get_collection and create_collection — but ChromaDB ties the embedding function to the collection at creation time. If a palace was created with the default model and is later opened with MEMPALACE_EMBEDDING_MODEL set (or vice versa), ChromaDB will silently accept the mismatch and your embeddings become incoherent (new writes land in a different region of embedding space from old ones).
The guard against this would be a warning or error when the configured model doesn't match what the collection was created with — or at minimum a note in the docs that changing embedding_model on an existing palace requires re-mining.
Minor: no tests for the new config property or the _get_embedding_function() path. Even a couple of unit tests with a mock model name would catch regressions.
The core feature is right and the backward-compatibility story is solid (null default = unchanged behaviour). The model-mismatch case is the one thing worth addressing before merge — even just a doc note would be better than nothing.
[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]
|
This is the right fix for #516 — and the design is clean. The layered config priority (env var → config.json → default) is the correct approach. Env var support means CI/CD and containerized setups can override without touching config files, and the config.json path keeps it user-friendly for interactive use. A few notes from running non-English content through our integration: Backward-compat risk: If a user sets Empty string edge case: Missing The lazy import for Overall this unblocks a real pain point for non-English users. LGTM with the backward-compat caveat addressed. |
|
#442 covers the same feature (configurable embedding model via env var / config.json) with a broader scope -- it patches all 7 ChromaDB consumer modules, adds mismatch detection, and includes a test suite. Both PRs touch config.py and palace.py for the same purpose so they'll conflict. |
|
Good callout @mvalentsev — yes, both PRs target the same config surface and will conflict. The scopes are genuinely different though: this PR (#553) is minimal — configurable model for the primary search + mine path. #442 goes wider (all 7 ChromaDB consumers + mismatch detection + a test suite). Both are valid approaches. Whether to merge #442 first, close this one, or merge this as a stopgap probably depends on #442's merge timeline and maintainer preference. If #442 is close to ready, that's likely the better merge path given the broader coverage. If #442 is stalled, this is a fast unblock for non-English users. Either way, @oeniao it would be worth opening a quick coordination comment on #442 to flag the conflict — or offering to pull in mvalentsev's mismatch detection test idea into this PR if #442 ends up waiting. [MemPalace-AGI integration — 215 tests, 710 KG entities] |
|
Hey @oeniao — thanks for tackling this, non-English search is definitely a pain point. Just a heads-up: #442 addresses the same problem with broader coverage — all 7 ChromaDB consumer modules are patched (not just One note on the model choice: Since both PRs touch the same config surface and will conflict, it probably makes sense to coordinate. Happy to discuss! |
ORT_DISABLE_COREML is not a recognized ONNX Runtime environment variable. ONNX Runtime does not expose a global env var to disable individual execution providers -- providers are selected per session via the providers argument to InferenceSession. Setting it had zero effect, so the CoreMLExecutionProvider was still loaded on Apple Silicon and the segfault from MemPalace#74 was not actually mitigated. Replace the misleading setdefault call with a comment that records the history and points at the real fix path. The proper CoreML workaround requires passing preferred_providers at the ChromaDB embedding function level, which is a larger change that belongs in its own PR once the configurable embedding model work in MemPalace#442 / MemPalace#553 lands. Closes MemPalace#397
Problem
The default ChromaDB embedding model (
all-MiniLM-L6-v2) is English-only. Users writing in Chinese, Japanese, Korean, or other non-Latin languages get poor search results — match scores go negative and retrieved memories are irrelevant.Solution
Add an
embedding_modelconfig option that lets users opt into any HuggingFace sentence-transformers model without patching installed files.Configuration (priority order):
MEMPALACE_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2~/.mempalace/config.json→{"embedding_model": "paraphrase-multilingual-MiniLM-L12-v2"}null→ uses ChromaDB built-in (unchanged behaviour)Example for Chinese/Japanese/Korean users:
{ "embedding_model": "paraphrase-multilingual-MiniLM-L12-v2" }Changes
config.py: addedDEFAULT_EMBEDDING_MODEL = Noneandembedding_modelproperty with env var support (MEMPALACE_EMBEDDING_MODEL)palace.py: added_get_embedding_function()helper that reads config and lazily importssentence-transformersonly when neededBackward compatibility
null→ behaviour identical to beforesentence-transformersis not a new required dependency — only needed whenembedding_modelis setMotivation
Discovered this while setting up MemPalace for daily use in Chinese. Had to patch the installed package directly — this PR makes it a proper first-class config option.